html_table()read_html()html_nodes()html_text()html_table() to scrape table structuresNote: Do not download thousands of HTML files from a website to parse — the admins might block you if you send too many requests. Download your file once in a separate code chunk from manipulating it
Let’s load the tidyverse:
library(tidyverse)robots.txt which can be accessed directly as URL/robots.txtOkay with the following:
User-agent: * Disallow:
Not okay with
User-agent: * Disallow: /
In an HTML document, an object can be an element node but also a text node or attribute node.
We have to know a little bit about CSS to understand how to extract individual elements from a website.
CSS is a formatting language based on “selectors” use to control how HTML files should look. Every website is formatted with CSS.
Here is some example CSS:
h3 {
color: red;
font-style: italic;
}
footer div.alert {
display: none;
}Specifically, for the two examples above, they correspond to:
<h3>Some text</h3>
<footer>
<div class="alert">More text</div>
</footer>h3 properties say to make the h3 headers red and in italics.<div> tags of class "alert" in the <footer> should be hidden.We can use CSS selectors to identify the elements of a website we are interested in.
When you scrape the HTML document, and convert it into text, you will see some common HTML tags to delineate the nodes in the document, the internal structure, and convey formatting information
` <html>` | At the Start and end of an HTML document
` <head>` | Header Information
`<title> website title </title>` | Website Title
`<body>` | Before and after all the content
`<div> ... </div>` | Divide up page content into sections, and applying styles
`<h?> heading </h?>` | Heading (h1 for largest to h6 for smallest)
`<p> paragraph </p>` | Paragraph of Text
`<a href="url">` link name </a>| Ancho with a link to another page or website
`<img src="filename.jpg">`| Show an image
`<ul> <li> list </li> </ul>` |Unordered, bullet-point list
`<b> bold </b>`| Make text between tags bold
`<i> italic </i>`| Make text between tags italic
`<br>` |Line Break (force a new line)
`<span style="color:red"> red </span>` |Use CSS style to change text colourSelectorGadget is an extension for Chrome that allows you to see what CSS selector influences a particular element on a website.
To install SelectorGadget, drag this link to your bookmark bar on Chrome: SelectorGadget or go to Chrome Web Store
Suppose we wanted to get the top 100 movies of all time from IMDB. The web page is very unstructured:
https://www.imdb.com/list/ls055592025/
If we click on the ranking of the Godfather, the “1” turns green (indicating what we have selected).
The “.text-primary” is the selector associated with the “1” we clicked on. Look at the box in the bottom right
Everything highlighted in yellow also has the “.text-primary” selector associated with it.
We will also want the name of the movie. So if we click on that we get the selector associated with both the rank and the movie name: “a , .text-primary”.
But we also got a lot of stuff we don’t want (in yellow). If we click one of the yellow items we don’t want, it turns red. This indicates we don’t want to select it.
Only the ranking and the name remain, which are under the selector “.lister-item-header a , .text-primary”.
It’s important to visually inspect the selected elements throughout the whole HTML file. SelectorGadget doesn’t always get all of what you want, or it sometimes gets too much.
If you have trouble with SelectorGadget, you can also use the Chrome developer tools.
You can also right click on an element and select inspect
Clicking on the element selector on the top left of the developer tools will show you what selectors are possible with each element.
read_html().html_nodes(doc, "table td")html_name() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).xml(), then extract components using xml_node(), xml_attr(), xml_attrs(), xml_text() and xml_name().html_table().write_html() or write_xml() to save the HTML data to diskhtml_form(), set_values() and submit_form().guess_encoding() and repair_encoding().```r
library(rvest)
```
Use read_html() to save an HTML file to a variable. The variable will be an “xml_document” object
html_obj <- read_html("https://www.imdb.com/list/ls055592025/")
html_obj
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
class(html_obj)
## [1] "xml_document" "xml_node"rvest will store the HTML file as an XML object.
We can use html_nodes() and the selectors we found in the previous section to get the elements we want.
Insert the found selectors as the value for the css argument. This will produce an object of class “xml_nodeset”
ranking_elements <- html_nodes(html_obj, css = ".lister-item-header a , .text-primary")
head(ranking_elements)
## {xml_nodeset (6)}
## [1] <span class="lister-item-index unbold text-primary">1.</span>
## [2] <a href="/title/tt0068646/?ref_=ttls_li_tt">The Godfather</a>
## [3] <span class="lister-item-index unbold text-primary">2.</span>
## [4] <a href="/title/tt0111161/?ref_=ttls_li_tt">The Shawshank Redemption</a>
## [5] <span class="lister-item-index unbold text-primary">3.</span>
## [6] <a href="/title/tt0108052/?ref_=ttls_li_tt">Schindler's List</a>html_text():This produces a character vector (here length 200)
ranking_text <- html_text(ranking_elements)
head(ranking_text)
## [1] "1." "The Godfather"
## [3] "2." "The Shawshank Redemption"
## [5] "3." "Schindler's List"nrow() is 200 in this example)
row_number()pivot_wider() to break out the ranks and movie titles with names_from = iseven and values_from = textVoila
tibble(text = ranking_text) %>%
mutate(rownum = row_number(),
iseven = rownum %% 2 == 0,
movie = rep(1:100, each = 2)) %>%
#view()
select(-rownum) %>%
#spread(key = "iseven", value = "text") %>%
pivot_wider(names_from = iseven, values_from = text) %>%
#view()
select(-movie, "Rank" = "FALSE", movie = "TRUE") %>%
mutate(Rank = parse_number(Rank)) ->
movierank
movierank
## # A tibble: 100 x 2
## Rank movie
## <dbl> <chr>
## 1 1 The Godfather
## 2 2 The Shawshank Redemption
## 3 3 Schindler's List
## 4 4 Raging Bull
## 5 5 Casablanca
## 6 6 Citizen Kane
## 7 7 Gone with the Wind
## 8 8 The Wizard of Oz
## 9 9 One Flew Over the Cuckoo's Nest
## 10 10 Lawrence of Arabia
## # … with 90 more rowsLet’s try and get the name, rank, year, genre, and metascore for each movie:
html_obj so don’t need to scrape againCopy the CSS selectors and use html_nodes() to create a text vector
dataobj <- html_nodes(html_obj,
css = ".favorable , .genre, .unbold,
.lister-item-header a")
datatext <- html_text(dataobj)We have a lot of cleaning to do. Note the first 132 elements we got we didn’t even want:
#view(datatext)
length(datatext)
## [1] 628
datatext[131:136]
## [1] "\n Suicide\n (17)\n "
## [2] "\n 1930s\n (16)\n "
## [3] "1."
## [4] "The Godfather"
## [5] "(1972)"
## [6] "\nCrime, Drama "
length(datatext)-132 # we are missing four elements somewhere
## [1] 496"\\d+\\.".
'rep() before) to figure out to which movies the elements belong.Use this to filter out the initial (pre-rankings) rows that are all 0 and assign back to the data frame
datadf <- tibble(text = datatext)
datadf %>%
mutate(ismovierank = str_detect(text, "^\\d+\\.$")) ->
datadf
#view(datadf)
#
## Check to make sure you have 100 ranks
sum(datadf$ismovierank)
## [1] 100
## get movie numbers and remove non-movie elements:
datadf %>%
mutate(movienum = cumsum(ismovierank)) %>%
filter(movienum > 0) ->
datadf
datadf
## # A tibble: 496 x 3
## text ismovierank movienum
## <chr> <lgl> <int>
## 1 "1." TRUE 1
## 2 "The Godfather" FALSE 1
## 3 "(1972)" FALSE 1
## 4 "\nCrime, Drama " FALSE 1
## 5 "100 " FALSE 1
## 6 "2." TRUE 2
## 7 "The Shawshank Redemption" FALSE 2
## 8 "(1994)" FALSE 2
## 9 "\nDrama " FALSE 2
## 10 "80 " FALSE 2
## # … with 486 more rowsismovierank for each data element to identify which rows they are in
name: We can use the movierank$movie variable we created before to see which rows are movie names
datadf %>%
mutate(isname = text %in% movierank$movie) ->
datadf
## make sure we have 100 movies:
sum(datadf$isname)
## [1] 100
datadf
## # A tibble: 496 x 4
## text ismovierank movienum isname
## <chr> <lgl> <int> <lgl>
## 1 "1." TRUE 1 FALSE
## 2 "The Godfather" FALSE 1 TRUE
## 3 "(1972)" FALSE 1 FALSE
## 4 "\nCrime, Drama " FALSE 1 FALSE
## 5 "100 " FALSE 1 FALSE
## 6 "2." TRUE 2 FALSE
## 7 "The Shawshank Redemption" FALSE 2 TRUE
## 8 "(1994)" FALSE 2 FALSE
## 9 "\nDrama " FALSE 2 FALSE
## 10 "80 " FALSE 2 FALSE
## # … with 486 more rows= years: note the Years are surrounded by parentheses so we can use regex to add a variable to determine which row is a year:
datadf %>%
mutate(isyear = str_detect(text, "\\(\\d+\\)")) ->
datadf
## make sure it is 100
sum(datadf$isyear)
## [1] 100
datadf
## # A tibble: 496 x 5
## text ismovierank movienum isname isyear
## <chr> <lgl> <int> <lgl> <lgl>
## 1 "1." TRUE 1 FALSE FALSE
## 2 "The Godfather" FALSE 1 TRUE FALSE
## 3 "(1972)" FALSE 1 FALSE TRUE
## 4 "\nCrime, Drama " FALSE 1 FALSE FALSE
## 5 "100 " FALSE 1 FALSE FALSE
## 6 "2." TRUE 2 FALSE FALSE
## 7 "The Shawshank Redemption" FALSE 2 TRUE FALSE
## 8 "(1994)" FALSE 2 FALSE TRUE
## 9 "\nDrama " FALSE 2 FALSE FALSE
## 10 "80 " FALSE 2 FALSE FALSE
## # … with 486 more rows
Genre; each genre begins with a new line tag so again we can use regex to identify those rows:
datadf %>%
mutate(isgenre = str_detect(text, "^\\n")) ->
datadf
## make sure it is 100
sum(datadf$isgenre)
## [1] 100datadf %>%
group_by(ismovierank, isname, isyear, isgenre) %>%
count() # we are missing four as we suspected
## # A tibble: 5 x 5
## # Groups: ismovierank, isname, isyear, isgenre [5]
## ismovierank isname isyear isgenre n
## <lgl> <lgl> <lgl> <lgl> <int>
## 1 FALSE FALSE FALSE FALSE 96
## 2 FALSE FALSE FALSE TRUE 100
## 3 FALSE FALSE TRUE FALSE 100
## 4 FALSE TRUE FALSE FALSE 100
## 5 TRUE FALSE FALSE FALSE 100
datadf %>%
mutate(ismeta = !ismovierank & !isname & !isyear & !isgenre) ->
datadf
datadf
## # A tibble: 496 x 7
## text ismovierank movienum isname isyear isgenre ismeta
## <chr> <lgl> <int> <lgl> <lgl> <lgl> <lgl>
## 1 "1." TRUE 1 FALSE FALSE FALSE FALSE
## 2 "The Godfather" FALSE 1 TRUE FALSE FALSE FALSE
## 3 "(1972)" FALSE 1 FALSE TRUE FALSE FALSE
## 4 "\nCrime, Drama … FALSE 1 FALSE FALSE TRUE FALSE
## 5 "100 " FALSE 1 FALSE FALSE FALSE TRUE
## 6 "2." TRUE 2 FALSE FALSE FALSE FALSE
## 7 "The Shawshank Redemption" FALSE 2 TRUE FALSE FALSE FALSE
## 8 "(1994)" FALSE 2 FALSE TRUE FALSE FALSE
## 9 "\nDrama " FALSE 2 FALSE FALSE TRUE FALSE
## 10 "80 " FALSE 2 FALSE FALSE FALSE TRUE
## # … with 486 more rowsLet’s create a key variable for the data in text using dplyr::case_when() and then use pivot_wider() to spread them:
datadf %>%
mutate(key = case_when(ismovierank ~ "rank",
isname ~ "name",
isyear ~ "year",
isgenre ~ "genre",
ismeta ~ "metacritic")) %>%
select(key, text, movienum) %>%
# spread(key = "key", value = "text") ->
pivot_wider(names_from = key, values_from = text) ->
datawide
datawide
## # A tibble: 100 x 6
## movienum rank name year genre metacritic
## <int> <chr> <chr> <chr> <chr> <chr>
## 1 1 1. The Godfather (1972) "\nCrime, Drama … "100 …
## 2 2 2. The Shawshank Red… (1994) "\nDrama " "80 …
## 3 3 3. Schindler's List (1993) "\nBiography, Drama, Hi… "94 …
## 4 4 4. Raging Bull (1980) "\nBiography, Drama, Sp… "89 …
## 5 5 5. Casablanca (1942) "\nDrama, Romance, War … "100 …
## 6 6 6. Citizen Kane (1941) "\nDrama, Mystery … "100 …
## 7 7 7. Gone with the Wind (1939) "\nDrama, History, Roma… "97 …
## 8 8 8. The Wizard of Oz (1939) "\nAdventure, Family, F… "100 …
## 9 9 9. One Flew Over the… (1975) "\nDrama " "83 …
## 10 10 10. Lawrence of Arabia (1962) "\nAdventure, Biography… "100 …
## # … with 90 more rowsstr_replace_all()str_squish()movienum as no longer neededReassign back to the data frame
datawide %>%
mutate(genre = str_replace_all(genre, "\\n", ""),
genre = str_squish(genre),
metacritic = parse_number(metacritic),
rank = parse_number(rank),
year = parse_number(year),
movienum=NULL) ->
datawide
datawide
## # A tibble: 100 x 5
## rank name year genre metacritic
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 The Godfather 1972 Crime, Drama 100
## 2 2 The Shawshank Redemption 1994 Drama 80
## 3 3 Schindler's List 1993 Biography, Drama, Histo… 94
## 4 4 Raging Bull 1980 Biography, Drama, Sport 89
## 5 5 Casablanca 1942 Drama, Romance, War 100
## 6 6 Citizen Kane 1941 Drama, Mystery 100
## 7 7 Gone with the Wind 1939 Drama, History, Romance 97
## 8 8 The Wizard of Oz 1939 Adventure, Family, Fant… 100
## 9 9 One Flew Over the Cuckoo's N… 1975 Drama 83
## 10 10 Lawrence of Arabia 1962 Adventure, Biography, D… 100
## # … with 90 more rowshtml_table()When data is in the form of a table, you can format it more easily with html_table().
The Wikipedia article on hurricanes: https://en.wikipedia.org/wiki/Atlantic_hurricane_season
This contains many tables which might be a pain to copy and paste into Excel (and we would be prone to error if we tried).
html_table() makes a few assumptions:
Save the website HTML once using read_html()
wikixml <- read_html("https://en.wikipedia.org/wiki/Atlantic_hurricane_season")We’ll extract all of the “table” elements.
wikidat <- html_nodes(wikixml, "table")Use html_table() to get a list of tables from table elements:
tablist <- html_table(wikidat)
## Error: Table has inconsistent number of columns. Do you want fill = TRUE?
#class(tablist)
# length(tablist)
tablist[[19]] %>%
select(1:4)
## Error in eval(lhs, parent, parent): object 'tablist' not foundYou can clean up, bind, or merge these tables after you have read them in.
<https://en.wikipedia.org/wiki/List_of_the_oldest_mosques>
Hint: It’s easier if you use a css selector of "table.wikitable" to get the table rather than just "table". Get to the developer tools in Chrome and play around with the tables.
The first 15 rows should look like this when you are done:
## Building Country fb category
## 1 Al-Haram Mosque Saudi Arabia <NA> Mentioned in the Quran
## 2 Al-Aqsa Mosque Palestine <NA> Mentioned in the Quran
## 3 The Sacred Monument Saudi Arabia <NA> Mentioned in the Quran
## 4 Quba Mosque Saudi Arabia 622 Mentioned in the Quran
## 5 Mosque of the Companions Eritrea 610 Northeast Africa
## 6 Negash Āmedīn Mesgīd Ethiopia 620 Northeast Africa
## 7 Masjid al-Qiblatayn Somalia 620 Northeast Africa
## 8 Korijib Masjid Djibouti 630 Northeast Africa
## 9 Mosque of Amr ibn al-As Egypt 641 Northeast Africa
## 10 Mosque of Ibn Tulun Egypt 879 Northeast Africa
## 11 Al-Hakim Mosque Egypt 928 Northeast Africa
## 12 Al-Azhar Mosque Egypt 972 Northeast Africa
## 13 Arba'a Rukun Mosque Somalia 1268 Northeast Africa
## 14 Fakr ad-Din Mosque Somalia 1269 Northeast Africa
## 15 Great Mosque of Kairouan Tunisia 670 Northwest Africa